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Abstract 

We extend the Chow-Liu algorithm for general random variables while the pre- 
vious versions only considered finite cases. In particular, this paper applies the 
generalization to Suzuki's learning algorithm that generates from data forests 
rather than trees based on the minimum description length by balancing the 
fitness of the data to the forest and the simplicity of the forest. As a result, we 
successfully obtain an algorithm when both of the Gaussian and finite random 
variables are present. 

1 Introduction 

Learning statistical knowledge from data takes large computation. For exam- 
ple, constructing a Bayesian network structure expressed by a directed acyclic 
graph from data requires exponential time as the number of nodes (attribute 
values) increases. We eventually compromise between the accuracy and the time 
complexity of the learning algorithms by choosing its approximation to the best 
solution. Even in such situations, how to avoid overestimation should be con- 
sidered. In this paper, we address how to efficiently estimate the dependency 
relation among attributes values by constructing an undirected graph (a Markov 
network) via the Chow-Liu algorithm [2] . 

The original Chow-Liu algorithm approximates a probability distribution by 
a Dendroid distribution expressed by a tree to obtain the best solution in the 
sense that the KuUback-Leibler information is the smallest from the original 
distribution. The algorithm utilizes the Kruscal algorithm [P: starting with a 
finite set V and weights {wij}ij^v,i^j 

1. E:={} 

2. £ ■.= {{i,j}\i,jeV,i^j} 

3. £ :— for G £ maximizing Wij 

4. if {V,EU {{i,j}}) does not contain a loop, then E := E U {{i,j}}- 
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5. if £ ^ {}, then go to 3., else terminate. 

As a result, a tree {V, E) with the maximum value of X^jj^j^g^ Wij is obtained. 
Mutual information of two random variables X^'^X^-'^ is used as Wij in 

the Chow-Liu algorithm. 

For instance, suppose the values of mutual information of pairs of 

X^^\X^^^ {i ^ j) are given in Table 1. Then, we follow: 

1. Connect X^^\X^^^ first because 7(1, 2) is the largest; 

2. connect X'<^\X^^^ because /(1, 3) is the largest among the unselected; 

3. do not connect X^'^\X'^'^^ because /(2,3) is the largest among the unse- 
lected but connecting X^'^\X^'^^ will make a loop; 

4. connect X'^^\X^'^^ because /(1, 4) is the largest among the unselected; 

5. terminate the process because adding any of the remaining candidates will 
make a loop. 

Table 1: Mutual Information for 
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If the distribution is not given but samples are given, the task is estimation 
rather than approximation. Then, the Chow-Liu algorithm uses the maximum 
likelihood estimators of mutual information rather than the true mutual in- 
formation values. Then, we would only choose a high fitness tree, without 
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considering the complexity of the trees and the number of parameters: a (un- 
connected) forest rather than a (spanning) tree might have been closer to the 
true distribution. The order of selecting pairs of nodes may be different if we 
take into account the simplicity of the forests/trees structures. 

In 1993, Suzuki^ proposed a modified version of the Chow-Liu algorithm 
based on the Minimum Description Length in which the mutual information 
is replaced by the one minus a penalty value defined for each pair of random 
variables in order to consider the simplicity of the forest. The modified algorithm 
obtains the best forest in the sense of MDL. 

However, those results assume that those random variables take finite values. 
This paper deals with the general case: the Chow-Liu and Suzuki algorithms 
for general random variables. 

In Section 2, we clearly express the Chow-Liu and Suzuki algorithms for 
capturing essentials. Section 3 deals with the generalizations. For the Suzuki 
algorithm, we consider two cases: 

1. only Gaussian random variables are present. 

2. both Gaussian and finite random variables are present. 

In Section 4, we summarize the results in this paper and state future works. 

2 For finite random variables 
2.1 Definitions 

Let V and E he a finite set and a subset of £ := {{u,v}\u,v G V,u v}, 
respectively. The pair (V, E) is said an undirected graph. For undirected graph 
G = {V,E), V and its elements are said a vertex set and a vertex of G, respec- 
tively; and E and its elements are said an edge set and an edge of G, respectively. 
The sequence {vi}^^^ (A: = 0, 1, • • • ) is said a path connecting vo,Vk G ^ if there 
exist vi,--- ,Vk-i (z V such that £ E, i — 1, ■ ■ ■ , k. In particular, 

if vo = Vk, the path {J7i}fLo ^^id a loop. The undirected graph G is said a 
forest if G does not contain any loop, and is said to be connected if there exists 
a path connecting each pair of vertexes in G. Any connected forest is said a 
tree. 

On the other hand, a pair of a finite set V and a subset E of {{u,v)\u,v £ 
V,u ^ v} is said a directed graph. In directed graphs, we distinguish (w, v), {v, u) G 
E. 

For each i,j = I,-- - ,N {i ^ j), let X*^*^ be random variables that take 
finite values in X(*)(f7), P^{x) a probability of X^'^ e X'^''>{n), Pi.j{x,y) a 
probability of X^^ = x e X^'\n) and = y G X'^i^n), and Pi^j{x\y) 

a conditional probability = x (E X^^^n) given X'^i^ = y e X^^\n) 

{Pi{x),x e X^^\Cl) if j = 0). We define the mutual information between 



3 



by 



P^^x)PAy) ■ 

a;eX(i)(n),yGX(3)(0) ^ ^ ■'^"^ 

We assume a natural bijection between vertexes in V ~ {1, • • • ,-/V} and 
N random variables X^^\ ■ ■ ■ ,X'^^l 

2.2 The original Chow-Liu algorithm 

We consider to approximate the probability Pi ... ■ ■ ■ ,x^-^^) of X'-^^ = 

eX(^\n),--- eXW(l]) by' 

N 

Qi,..,Ar(x(i),... (1) 

(the Dendroid distribution), where tt : {1, • • • , N} — {0, 1, • • • , N} is to satisfy 
7r'^(i) ^ I, i = 1, • • • , iV, fc = 1, 2, • • • if we define 

n°{i)^i, 7r'=(i) = 7r(7r'=-i(i)), /c = l,2,--- . 

Although the Dendroid distribution ([T]) is expressed by a directed graph with 
emitting vertexes j G {1, • • ■ , N} such that = in general, it can be re- 
garded as an undirected {V, E) such that V :— - ■ ■ , N} and E {{i, 7r(i)}|7r(i) ^ 
0,4 e y}. Since 

Qi,.. , . . . , = =^ Pi,... , . . . , xW) = 

is true, we can define the Kullback-Leibler information from Pi ... jv to Qi ... n 

m- 

D{Pi,...n\\Qi.-.n) 
xWexWin),--- ,2;(«)ex(")(o) 

°^gi,...,Ar(x(l),... ■ 

We wish to identify Qi.....n so that the value of ^'(Pi.... ,Ar| IQi,... ,Ar) is mini- 
mized. In other words, we evaluate the error by -D(Pi.... ,Ar||Qi,... ,Ar) when we 
approximate Pi^...^]\[ by Qi,- -,JV! s-^d find minimizing it. On the other hand, 
since 

Qi,..,^(xW,... 

(2) 
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we have 



D{Pi,..,n\\Qi.-m) = - 

7r(i)#0 

+ J2 Pi,-,n{x 

a;(i)eX(i)(0),--- ,2;(«)eX(")(0) 

Pi,. .,^(xW,--. 



■ log ^iV • (3) 



to find the last term m ([3]) does not depend on tt. Hence, minimizing D{Pi^... ,Ar||Pi,... ,Ar) 
is equivalent to maximizing ^^^^ jjes Hhj)- this case, the (undirected) forest 
has only one i £V such that 7r(i) = (undirected tree). 

To this end, we apply the Kruscal algorithm which is used for maximizing 
the total weights along with the obtained tree if we have the values of weights 
for all the pairs of vertexes beforehand. In this case, the value of each edge is 
the mutual information 

Algorithm 1 (Chov^r-Liu, 1968) @ 

Input {/(i,j)},^j 

Output E 

1. E:^{}; 

2. £■.^{{^,J}\^^J}; 

3. £ := £\{{i,j}} for {i,j} € £ maximizing 

4. if {V,EU {{i,j}}) does not contain loop, then E := EU 

5. if f ^ {}, then go to 3, else terminate. 

(U and \ denote the addition and subtraction of two sets.) 

The Kruscal algorithm outputs a tree with the maximum total weights (Aho, 
Hopcraft, Ullman, 1974 IJ. 

2.3 Maximizing Likelihood 

If distributions such as Pi^...^n, Qi,- -,n are not given, we need to estimate 
the parameters 9 expressing P{x'^^\--- ,a;'-^''|6') and Q{x'^^\--- ,x^-^^9). In 
this case, if we differentiate —\ogP{x^^\--- ,x*^^^|6') by each component of 9 
to obtain the maximum likelihood estimators 9, we find that they are relative 
frequencies: 

P(xW,.- - ,xW|g(:r"))^ '"^^"^^^ , 
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where ci,... ^n{x'-^\ • • • , x*^^^) is the numbers of occurrences 

Given n training sequences 

X- {{x\'\- ■ ■ ,xf^m^, e X ... X XW(f]))" , 

let Cj(x), Cj(y), and Cij{x,y) be the numbers of occurrences of X^'^ = a; € 
XW(f2), = y e XW(f2), and = (a;,y) e x 

respectively. Then, minimizing 

Diprnx^mmx-))) 
E 



P(a;(i),... ,a;(^)|^(a;"))log 



is equivalent to minimizing 



H{n,x^) := 5;]-logQKW,... ,a;W|^>")) 



{j,j}eE 

E - Pi{x^"^ \d {x'')) ^og Pi{x^''> \e {x" }) . (4) 

j=i x«ex(*)(n) 
and in Algorithm 1 is replaced by 

In{-i,j) 

E c.,(.,,)logi;j^ 

^-^ Ci{x)Ci{v) 

to obtain the structure tt for the Dendroid distribution. 

More accurate learning results could be obtained without approximating to 
the Dendroid distribution, say depending on more than one parent. However, 
exponential order computation of N is required in general. The Chow-Liu algo- 
rithm and its variant complete in 0{N'^) time, and is easier to apply to realistic 
problems. 

2.4 Minimizing description length 

Another way to deal with the case that distributions Pi,... ,jv, Qi,--- ,jv are not 
given is to mixture P{x^^\--- ,a;(^)|^) and Q{x^^\--- ,x'^^'>\6) by w w.r.t. 6 



6 



such that Jw{9)d9 = 1: 
and 

We consider to find the structure tt maximizing 

n 
i=l 

or, equivalently, mimmizmg X^ILi ~ ^'^S '9i(a^*-^'' , • ' ' j 2;^^'') rather than mimmiz- 
ing 7J(7r,a;"). The quantity is said description length because it satisfies the 
Kraft inequahty in information theory 

Let a^'^ be the number of elements in X('^(ri), i — 1, - ■ ■ ,N, and a'"-* :— 1. 
We notice that Qi,...,Ar has k :— J^iLii^^^'' ~ parameters: for each 

Xi^i^)) = x'^^ii)) (z the probabilities of X^*) = S X^-'^n) should 

be specified. Then, there exists a constant C such that [3] 

L(^,x"):-H(7r,a;") + ^logn + C>X]-logQ(a:f\--- , (5) 

i=l 

and the left hand side also satisfies the Kraft inequality for each tt. 

The number of parameters increases from a*-'' — 1 to a*^^*-'^-* (a*^*-* — 1) if we 
connect i and 7r(i) as an edge, so that from the description length ([5]) 
becomes 

{ij}6-B {ij}e-E 

where C is a constant that does not depend on the structure tt. Thus, we only 
need to maximize j}eE Jnihj) with 

Jn{i,j) := In{i,j) - ^(a^'^ - - l)logn . (6) 

This time, we apply the Kruscal algorithm with {Jn{i, rather than the 

one with {In{i,j)}i^J^■ 

Algorithm 2 (Suzuki, 1993) @ 

Input V,{Jn{i-,i)}t^j 

Output E 
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1. E = {}; 

2. £:={{i,j}\i,j€V,i^j}; 

3. £ := £\{{i,j}} for G £ maximizing Jn{i,j); 

4. If J„(z, j) > and {V, Eu{{i,j}}) does not contain loop, E := EL){{i,j}}; 

5. if £ {}, then go to 3., else terminate 

Example 1 Suppose that the values of Jn{i,j) are given in Table 2., and that 
= 5, a(2) = 2, a(3) = 3^ ^^d = 4. 

1. Connect X^^),^^^) because J„(l,2) = 8 is the largest. 

2. Connect X'-'^\X'-^^ because J„(2,3) = 6 is the largest among the unse- 
lected. 

3. Do not connect X^^\X^^^ because J„(l,3) = 2 is the largest among the 
unselected but connecting them will make a loop. 

4. Connect X^^\X^'^^ because 7„(2,4) = 1 is the largest among the unse- 
lected. 

5 . Terminate the process because for the remaining candidates Jn{i,j) < 
or adding any of them will make a loop. 



Table 2: Example 2 
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Both of In{i,j) and Jn{i,j) are criteria for choosing We notice that 

/„(i, j) only sees if the training sequence cc" fits the structure tt. On the other 
hand, J„(i, j) looks at the simplicity of the forest as well as the fitness, so that 
even if f ^ {}, the process stops if J„(z, j) < for all the rest of {i, j}'s. The 
resulting forest can be either connected or unconnected. Since the selecting 
order is different between and {Jn{'i,j)}i^j, the structures of the 

resulting forests are different when the both algorithms complete. 

k k 
Furthermore, — logn in ([5]) can be replaced by — (i„ with nonnegative real 

sequence {c?n}5^x such that lim — — for general information criteria. 



3 For general random variables 

Consider the general random variables: 

Example 2 Suppose that random variable X has the distribution function 



where git) = 1. Such an X does not have any probability density func- 
tion fx such that Fx{x) — fx{t)dt, which means X is neither discrete or 
continuous. 

In this section, how the Chow-Liu and its variants can be extended for such 
general random variables. 

3.1 Definitions 



n— >oo fi 
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X < -1 



-1 < a; < 



We fix a probability space (il, /i), where $7 is a sample space, J-" is a ct set field 
of ri, i.e. a set consisting of the sets obtained by applying a countable number 



of set operations U, \, n to subsets of $7. The elements of T is said an event. We 
denote by B the a set field generated by the whole open sets in R (the Borel set 
field of K). In general, if the mapping / : il — )• R satisfies 

f is said measurable on J^. The mapping v : ^ R satisfying 

1. iy{A) >0,AeJ' 

2. An B = {} ^ v{A[J B) = u{A) + u{B) 

3. K{})=0 

is said to be a measure. The fi in the probability space is a measure such that 

= 1 (probability measure). 
We can define the Lebesgue integral 

/ fdv := sup V{ inf f{uj)u{Ai)} = inf V{ sup f{uj)u{Ai)} 

w.r.t. measure i/ : ^ R and measureble bounded / on J^, where A = UjAj, 
AinAj = {}{i^j). 

For measures /i, on and A G T, ii i^{A) = =4> /i(A) = 0, is said to 
be absolutely continuous w.r.t. v, and write ji «v. Also, we say that measure 
V is (T-finite if = UjAj and v{Ai) < oo. 

Proposition 1 (Radon-Nikodym) For each A G if n,y are a-finite and 
a «v, then there exists measurable ^ := f > Q on T such that 

m(A) = / fdu 

J A 

Corollary 1 li ji « v « \, 

djj, _ dfi du 
dX du dX 

When fi << v, we define the KuUback-Leibler information 
D{f,\\i.) := J log(^)rfM . 
Properties such as D{ij,\\i/) > 0, D(/x||z^) = ■^=^ n = u are available. 
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I^x\y{D,D'\D) := ' for Mr(^) > 



3.2 Generalization 

In (i7,J^, /i), any measurable mapping X : CI ^ M. on is said a random 
variable. For D,D' e 6, let 

Hx{D) := e f^|X(a;) e £)}) , 

:=pi{{ujen\Y{uj) e D}) , 

Hxy{D,D') := e e D,Y{lo) e £)'}) , 

Hy{D) 
and 

uxy{D,D') := ,ix{D)fiY{D') . 

Then, we have 

uxy{D, D') = 0^ l^xviD, D')=0, 

which means that fixY is absolutely continuous w.r.t. fxY- We define the 
mutual information between X, Y by 

f d 
I{X,Y) := D{ij,xy\Wxy) = / fxxY {dx,dy) log {x,y) . 

Jxex{Q),yeY{n) avxY 

Hereafter, we denote '^^^^ in the definition by 



dvxY diixdiiY 
For random variables X'^^\ - ■ ■ , X^^\ we define lJLi{D) := ^ixii) {D), fj,i j{D, D') :- 
,,xi^,^xuAD,D'), := and ^l,^,{D\D') := tJix^.)\xuAD\D') 

ioii,j = !,■■■ ,N {i^ j), and ^ii^J{D\D') := ^ii{D) if j = 0. 
For D^^\- ■ ■ , D^^^ e B, we approximate 

:= M{wer!|x«(u;)e 

by 

i=l 

From 

1.1,.. ,;vp(^\ • • • , i^W) = ^ Ml,... MD^'^ , • • • , ^(^^) = , 
the KuUback-Leibler information /xi,.._jv w.r.t. !.i,..,jv is defined: 

£>(/ii,..,jv||Mi,-,Jv) := / 

Jx(i)ex(i)(n),... ,x('v)ex('')(n) 

We wish to find vi^.-.^n such that ,jv||i.i,... ,jv) is minimized. 
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Theorem 1 There exists a constant C not depending on tt such that 
D{fii^... ^n\Wi,- ,n) ^ - ^ /(i, 7r(i)) + C . 

Proof: Generalizing from ([7]), we have 
Let 



Then, we have 



(see Appendix for proof). From the corollary, we have 

dz^ d?7 d77 -1:^ dfud^i^u) d?] 
Furthermore, taking E log for the both sides, we have 
i?logf^(X«,...,xW) 

= - y £;iog "^'^^-"^'^ (xW,x(-M)) + giog^(xW,-- - . 

This completes the proof. 

3.3 When only Gaussian random variables are present 

We express the probability density funcions of X^'^ ^ A^(^(*\ afj) and {X'^^\X'-^^) ~ 
AA((^«,/x«),I]) by 

:= -==exp{-^ 

and 

27r|E|5 2 
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respectevly, where S = *' • Let pi j := be the correlation 

factor. Then, can be obtained via pij: 

= log-^^ 3— 

Chow-Liu algorithm can be appUcd using those values. 

As obtained in Section 2.3, the maximum likelihood estimators of In{i,j) 



In{h3) = -\\0g{l-p%) 

can be obtained from the training sequence of length n: a;" = {(x^^V" e 
(X(i)(l]) X ••• X xW(f^))" 

Let Aij € R, = ei ~ A/'(O,0i), X'^') = AijX(^) + e^, ~ M{Q,(j>j). 
Then, we have 

1 

Pi,3 



and 



Thus, Pij,Xij, Xj^i are bijection among any of two. Although under the condi- 
tion 

pij = <S=^ Xij = <S=^ Ujj = (l)i,crjj = 4>j,crij = , 

there are two independent parameters an = = if Xij ^ 0, another 

parameter o-jj- = Xf + (pi should be specified. 

Thus, if we consider the complexity of forests, adding one edge leads to 
adding one parameter, so that 

Jn{i,j) = In{i,j) - ■ 

It is possible that the process terminates before the forest becomes a tree if 
all the values of the rest of J„(«,j) are negative then. However, the order of 
selecting the edges are the same for {In{'i',j)}ijtj and {Jn{i,j)}ijtj- 
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3.4 When both Gaussian and finite random variables are 
present 

We consider the case that both Gaussian and finite random variables are present. 
Suppose that X^^^ and X'^^^ are Gussian and finite, respectively. Then, the 
mutual information is 



where Pj{y) HY{{y}),y G X'-^^fl), and fi^^j{x\y) is the conditional Gauss 
distribution given X'-^^ = y. Thus, X*^*' has as many Gaussian distributions 
as the values X^^^ takes. In particular, if for unknown g : X'-^^il.) — ^ M and 
a ~ AA(0, (^,) 



f.^,{x\y) = -^==cxp{- ^" } , (10) 



then the = a*^-'-' papameters g{y), y G X^^'^ should be estimated. The 

estimated mutual information becomes 



V£X''i) 



where fii^j{-\y) is the estimated probability density function in which g{y) in 
(fTO|) is replaced by the maximum likelihood estimator g{y): solve dL / dPj{y) = 
0, dL/dg{y) ^ 0, y € X^^^^n) for 

n 

L = iogn{/(4'^i.9(4'^))^.(4'^)} + A{i - E ^-'(y)} 



to obtain 



1 

where I[x^l^ = y] = 1 if x'^l^ = y, and otherwise. 

However, if X*^*' and X^^^^ are independent, then g is a constant and g{y) 
/u(^) for all y G X^^l Thus, 



1 " 



h=l 
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If {i,j} are not connected as an edge, the parameters w.r.t. X'^'^ is only /i*^'^ 
and an = However, if they are connected, we need to estimate g{y),y G 
X^^\fl) and so that the number of additional parameters is a^^^ — 1: 

(a(j') - 1) 

Jn{l,j) ■■= In{l,j) ^ dn 

in which the difference Jnihj) — Inihj) depends on {i, j}, and the selection 
order may be different. 
As a summary: 

1. if both of are finite: J„(z, j) ^ j) „ ^ ~ iKo^^^^ ~ 1) ^^ 

2. if both of X^^\X^^'' are Gaussian: Jn{hj) = In{i,j) — -^dn 

3. if X^^^ is Gaussian, and X^^^ is finite: Jn{i,j) — /„(i, j) — — — ^-dn 

Therefore, if X'*' is Gaussian, we only need to set a^'^ = 2 in ([H]) 



4 Concluding Remarks 

We extended the Ghow-Liu algorithm for the general random variables, and 
considered variants to take into account the complexity of the forest so that 
overestimation can be avoided for the general setting. 

As a future work, we can further consider ways to avoid overestimation for 
various cases as well as the finite and Gaussian cases. 



Appendix: proof of (19]) 

We arbitrarily fix £ M.^ and e > 0. For each rectangle 
(i^(i),... ,i^W)Ci?f 

:= {2/^ G R^||^(.^) - ^(2/^)1 < e,\pp^{x^) - p^{y^)\ < e, for vr(z) ^ 
drj drj d^id^i-^i^i) dfudfi^i^i) 

we have from Radon- Nikodym, 

uiD^^K . . . , > ^ mf^^ • • • . D^""^) > V{D^'\ • • • . Di^^Kfyye) 

and 

u{D^'\- ■ ■ , < sup ^(y^)^(D(i), . . . , dW) < . . . , , 

yN^jjN df] drj 
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thus, if 77(D(i),--- >0, 




+ e . 



Similarly, for n{i) ^ 0, if )M7r(i)(£'^''''^^) > 0, 

Since cc^ S and e > are arbitrary, ([8]) means dH). (We only need to 
consider € such that there exists {D'''^\--- ,_D'^^^) 9 satisfying 
77(L'(i), • • • ,i:>W) > and (D^'^W)) > for aU e > 0.) 
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